In this project, we analyze 3,202 stock price and volume data time series traded on the NASDAQ exchange between May 30th and October 30th, 2019. This date range was selected for its distance from significant biological and political disruption to the markets, which can both introduce artificial seasonality and increased random variation into forecasts. Data was sourced as comma-separated values through API from AlphaVantage.
Because of the time constraints involved with directly analyzing each stock’s realization, we developed a loop to process through each file, perform a linear model on each stock and where the slope for price is positive with a slope greater than 0.04 and the price is within an affordable range - between 5 and 50 dollars per share - we then select that stock to assess if the spectral density indicates any wandering behavior based on a peak at zero and no additional peaks thereafter. Because of this wandering, the sample ACFs were also expected to damp exponentially, indicative of non-stationary behavior, potentially trending behavior. This method allowed us to identify seven potentially ideal stocks for signal-plus-noise modeling with postive, profitable trending. From these 7 stocks, we selected one and modeled it. We chose this stock to model based on the stationarity of the noise around the signal.
plotts.sample.wge(df$low, trunc=25, arlimits=T)
files = list.files(path='../Time-Series-Stocks', pattern='*.csv')
for(file in files){
actualFile <- paste0('../Time-Series-Stocks/',file)
df <- read.csv(actualFile, header=T, strip.white=T)
# run linear regression to get the signal (average).
t = seq(1, nrow(df),1)
fit = lm(df$low~t)
# get the frequency values from the spectral density in the Parzen Window (we want them to wander without season; just trend)
dbz <- plotts.sample.wge(df$low)[4] # plotting sample plots to see the stocks while they process
# if the linear coefficient (deterministic signal) for the price is positive and the price is between 5 and 50 (affordable):
if(fit$coefficients[2] > 0.04 && df$low[nrow(df)] > 5 && df$low[nrow(df)] < 50){
for(i in 1:(length(dbz$dbz)-1)){
# if the realization is wandering (based on spectral density):
if(dbz$dbz[i] > dbz$dbz[i+1]){
write.table(df$symbol, './models_aic_less_than_0.csv', append=T)
}
}
}
}
Because we have a high, low, open, and close price, we wanted to visually inspect the relationship between these prices across each data point. Through this visual inspection, we noticed differing amounts of variation within each price sets across all data points. As a result, we decided to engineer two new features representing the difference between the high and low and open and close prices. These two new features allowed our multivariate modeling to ingest additional insights into the dynamic relationships betweeen prices in our data.
urlfile = "https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
df <- read_csv(url(urlfile))
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
df <- data.frame(Date=df$times, coredata(df[,2:5]))
fig <- df %>% plot_ly(x = ~Date, type="candlestick",
open = ~open, close = ~close,
high = ~high, low = ~low)
fig <- fig %>% layout(title = "Candlestick Chart for ACGL")
fig
Candlestick Chart for ACGL
After anlayzing the candlestick plot, we decided to use the low price as the target feature of the model. The reason we chose this is because when a stock is trending up, the low price is quick to point this out because the moving average will often rise above the low price, especially for stronger uptrending behavior. This further provides insight into potential investment profitability.
Below are functions for data portability throughout the project. These are the two source data sets we will use.
stock_data <- function(x){
urlfile="https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
df <- read_csv(url(urlfile))
df$volume <- (df$volume/10000)
HiLo <- df$high - df$low
HiClo <- df$high - df$close
OpHi <- df$open - df$high
OpClo <- df$open - df$close
OpLo <- df$open - df$low
CloLo <- df$close - df$low
varianceRatio <- (df$open - df$close) / (df$high - df$low) * 100
spread <- df$high - df$low
df <- data.frame(cbind(df, varianceRatio, HiLo, HiClo, OpHi, OpClo, OpLo, CloLo))
return(df)
}
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
stock_data_trans <- function(x){
urlfile="https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
df <- read_csv(url(urlfile))
df2 <- df[1:(nrow(df)-1),]
df2$volume <- (df2$volume/10000)
lowww <- artrans.wge(df$low, phi.tr=1)
HiLo <- df2$high - lowww
HiClo <- df2$high - df2$close
OpHi <- df2$open - df2$high
OpClo <- df2$open - df2$close
OpLo <- df2$open - lowww
CloLo <- df2$close - lowww
varianceRatio <- (df2$open - df2$close) / (df2$high - lowww) * 100
spread <- df2$high - lowww
df2$low <- lowww
dfnew <- data.frame(cbind(df2, varianceRatio, HiLo, HiClo, OpHi, OpClo, OpLo, CloLo))
return(dfnew)
}
dftrans <- stock_data_trans()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
Signal-Plus-Noise Model
In this Signal-Plus-Noise model, we perform a hypothesis test on the linear trend to identify using OLS if the trend is possibly deterministic. After positive confirmation from OLS, we then tested this with the Cochrane-Orcutt \(AR(1)\) based hypothesis test, which accounts for serial correlation in determining slope significance. This test also confirmed the signal is a deterministic trend.
Next, we removed the residuals from the trend line and built a model for this data. We then tested the residuals for white noise variance using the Ljung-Box test, which indicated there is not enough evidence to consider the residuals to be more than white noise. Because of the stationarity of the residuals, we were able to estimate a model using the linear slope and adding to it the variation of the residuals.
Forecasting error was measured in terms of Average Squared Error using the last trading week’s data points, for which there were five. The ASE was 0.1654078.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
# take a sample of the data, analyze
# plotts.sample.wge(df$low, arlimits=T)
####################
# Signal-Plus-Noise
####################
x = df$low
n = length(x)
t = 1:n
fit = lm(x ~ t)
summary(fit) # there appears to be deterministic trend based on OLS; the p-value is significant so reject the null that it is not
##
## Call:
## lm(formula = x ~ t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10389 -0.52398 -0.02693 0.34676 1.35981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.855438 0.123444 282.36 <2e-16 ***
## t 0.072922 0.002122 34.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6126 on 98 degrees of freedom
## Multiple R-squared: 0.9234, Adjusted R-squared: 0.9226
## F-statistic: 1181 on 1 and 98 DF, p-value: < 2.2e-16
# deterministic. The null argues that any present trend is random that will eventually traverse such a pattern that the realization
# will continue to approximate around the mean.
# Because OLS is not robust to non-stationarity, we apply the Cochrane-Orcutt test to also test the beta coefficient slope of time
# using an aproach that fits an AR(1) model to the residuals:
cfit = cochrane.orcutt(fit) # to confirm with chochrane-orcutt
summary(cfit) # Cochran-Orcutt also provides a significant p-value. Based on this, we assume the slope is not equal to zero and
## Call:
## lm(formula = x ~ t)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0451902 0.3933902 89.085 < 2.2e-16 ***
## t 0.0698107 0.0063528 10.989 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.366 on 97 degrees of freedom
## Multiple R-squared: 0.5546 , Adjusted R-squared: 0.55
## F-statistic: 120.8 on 1 and 97 DF, p-value: < 9.97e-19
##
## Durbin-Watson statistic
## (original): 0.39519 , p-value: 1.096e-16
## (transformed): 1.49316 , p-value: 3.732e-03
# therefore, there is deterministic that justifies fitting a signal-plus-noise model instead of an ARMA(p,q). However, ARMA(p,q)
# will be fitted later for comparison.
#x.z = x - fit$coefficients[1] - fit$coefficients[2]*t # derive residuals
x.z = fit$residuals # derive residuals (from the regression line)
ar.z = aic.wge(x.z, p=0:6, type="aic") # find a model to use for approximating the residuals. NOTE: (sigmaHAT_a)^2 = 0.1177843
# ar.z$p is the order p (aic selects p=2 where q=0, as does the bic)
# Transform the stock prices by the autoregressive coefficients of the fitted residuals from the linear regression model.
# Remove phi from residuals to remove serial correlation and allow us to model
y.trans = artrans.wge(df$low, phi.tr=ar.z$phi)
Signal-Plus-Noise Model
# also, transform the predictor variable (time) by the autogregressive coefficeints of the fitted residuals as well
t.trans = artrans.wge(t, phi.tr=ar.z$phi)
Signal-Plus-Noise Model
# Finally, regress the newly transformed stock prices (Y-HAT_t) on the transformed time (T-HAT_t)using ordinary least squares
fitco = lm(y.trans ~ t.trans)
summary(fitco) # check to see if the transformed beta coefficient for the slope is still significant
##
## Call:
## lm(formula = y.trans ~ t.trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97241 -0.18510 0.01497 0.20566 0.70938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.320641 0.074125 125.74 <2e-16 ***
## t.trans 0.070037 0.004634 15.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3452 on 96 degrees of freedom
## Multiple R-squared: 0.7041, Adjusted R-squared: 0.701
## F-statistic: 228.4 on 1 and 96 DF, p-value: < 2.2e-16
# Evaluating the residuals after Cochrane-Orcutt:
plotts.wge(fitco$residuals)
Signal-Plus-Noise Model
acf(fitco$residuals) # residuals appear to be white noise
Signal-Plus-Noise Model
ljung.wge(fitco$residuals) # there is not enough evidence based on the ljung-box test to reject the null hypothesis. Therefore,
## Obs -0.02329353 0.03696201 -0.0148023 -0.1264051 0.02979692 0.1170724 0.0121294 -0.04358782 -0.004767952 -0.09739137 0.04153028 -0.05785741 0.01704484 -0.08708388 -0.04190725 -0.00537386 0.008676478 -0.04774231 -0.123505 0.02173201 -0.01302386 -0.09454542 -0.1174163 -0.0818482
## $test
## [1] "Ljung-Box test"
##
## $K
## [1] 24
##
## $chi.square
## [1] 12.52529
##
## $df
## [1] 24
##
## $pval
## [1] 0.9733174
# we cannot assume that the residuals are not white noise.
# Final Signal-Plus-Noise Model: X_t = 34.855438 + 0.072922*t + Z_t, (sigmaHAT_a)^2 = 0.1177843 summary(fit = lm(x ~ t))
# creates the coefficients
ar.z$phi
## [1] 1.0533533 -0.3193699
# (1 - 1.0533533*B + 0.3193699*B^2)*Z_t = a__t. ar.z$phi from AR(2) creates the coeffients and (sigmaHAT_a)^2 = 0.1177843
# BUT, TO REITERATE: Final Signal-Plus-Noise Model is X_t = 34.855438 + 0.072922*t + Z_t, (sigmaHAT_a)^2 = 0.1177843
est_mod = gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843)
Signal-Plus-Noise Model
Signal-Plus-Noise Model
plotts.sample.wge(est_mod)
Signal-Plus-Noise Model
## $autplt
## [1] 1.0000000 0.9651366 0.9118145 0.8565512 0.8119889 0.7768475 0.7465706
## [8] 0.7193186 0.6922062 0.6658128 0.6410263 0.6151013 0.5857824 0.5597746
## [15] 0.5307624 0.5006611 0.4660858 0.4286500 0.3901764 0.3577466 0.3377352
## [22] 0.3232818 0.3066006 0.2867248 0.2640283 0.2377298
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] 14.3656799 10.0443945 4.2373256 0.3328415 3.7183504
## [6] -2.9293040 -4.4417351 3.3619398 -4.9979624 -6.0664768
## [11] -9.1976096 -2.4333288 -4.4143009 -9.4646269 -5.4455831
## [16] -7.8048186 -13.4622979 -6.1605324 -7.4926226 -13.0960169
## [21] -22.9914819 -16.3550408 -9.4486304 -17.4872277 -18.9786086
## [26] -12.5301829 -16.6987460 -15.5008055 -11.5993071 -17.8186056
## [31] -18.1266390 -19.1975188 -15.2008114 -21.5084173 -18.7413264
## [36] -14.4695154 -16.8957406 -18.8504054 -12.8515599 -20.4378658
## [41] -18.0737498 -19.9406019 -17.1370027 -17.0168315 -16.6613265
## [46] -14.2727801 -17.1060521 -17.8422981 -17.1279923 -26.4868302
##
## $dbz
## [1] 10.6211100 9.9096304 8.7223550 7.0687089 4.9951979
## [6] 2.6416904 0.3110044 -1.5988486 -2.9302225 -3.8882665
## [11] -4.6975926 -5.4166271 -6.0512655 -6.6709255 -7.3787573
## [16] -8.2217302 -9.1548511 -10.0826599 -10.9421629 -11.7433498
## [21] -12.5212558 -13.2727162 -13.9604085 -14.5674614 -15.1205846
## [26] -15.6523341 -16.1653392 -16.6486865 -17.1148075 -17.5949536
## [31] -18.0909271 -18.5424199 -18.8588992 -18.9968950 -18.9955537
## [36] -18.9338563 -18.8788838 -18.8729126 -18.9358535 -19.0569523
## [41] -19.1857241 -19.2516426 -19.2186147 -19.1245018 -19.0552735
## [46] -19.0825604 -19.2220228 -19.4272086 -19.6101446 -19.6834596
plotts.sample.wge(df$low) # the estimated model (above) matches to the actual model (here) on both sample ACFs and sample spectral
Signal-Plus-Noise Model
## $autplt
## [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
## [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] 14.2537148 9.9541873 5.8173014 -0.5437278 -0.8235314
## [6] -1.3564258 1.4197169 -1.1176687 -7.3533813 -3.8782332
## [11] -0.5445071 -3.3199576 -5.9250727 -5.4172047 -2.9311702
## [16] -11.1750831 -6.3519213 -18.5926329 -18.7079883 -8.4217804
## [21] -9.3402655 -9.6710352 -9.7031584 -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501 -9.7837888
## [31] -12.9010407 -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
##
## $dbz
## [1] 10.6247182 9.9050329 8.7000676 7.0111987 4.8698055
## [6] 2.3951954 -0.1075597 -2.1509930 -3.4458827 -4.1937204
## [11] -4.7075065 -5.1324715 -5.5423563 -6.0346253 -6.7027509
## [16] -7.5765092 -8.5979157 -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
# density as well as on partial ACF (below):
pacf(est_mod)
pacf(df$low)
#########################################################################################################
############### Confirmation for repeated model visualization of ACF and Spectral Density ###############
#########################################################################################################
for( i in 1: 5)
{
SpecDen2 = parzen.wge(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
lines(SpecDen2$freq,SpecDen2$pzgram, lwd = 2, col = "red")
}
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
for( i in 1: 5)
{
ACF2 = acf(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
lines(ACF2$lag ,ACF2$acf, lwd = 2, col = "red")
}
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
Signal-Plus-Noise Model
#########################################################################################################
#########################################################################################################
#########################################################################################################
signoise.forecast <- fore.sigplusnoise.wge(df$low, max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)
##
## Coefficients of Original polynomial:
## 1.0507 -0.3152
##
## Factor Roots Abs Recip System Freq
## 1-1.0507B+0.3152B^2 1.6667+-0.6283i 0.5614 0.0574
##
##
Signal-Plus-Noise Model
SIGNOISE_ASE = mean((df$low[(nrow(df)-5+1):nrow(df)] - signoise.forecast$f)^2)
SIGNOISE_ASE # 0.133713
## [1] 0.133713
In the ARIMA model, we selected a forecast horizon of five trading days because this timeframe completes a full business week. Models can be fully re-developed on non-trading days. However, unless there are two unit roots, ARIMA will not have enough precedential autocorrelation in the realization for a forecasted trend to continue. Consequently, it will have no option but to converge toward the mean. Because of the lack of a seasonal component, this model may not model well. However, as seen with the Signal-Plus-Noise model, the realization’s strong signal and weak noise components will work well for an Autoregressive/Moving Average model that tends toward the mean since this is what appears to be guiding the primary growth structure of the share price over time.
Because of the strong, positive signal of the price over time, the slowly dampening ACFs, and the strongest root in the series being a \((1-B)\), we differenced the data once to stationarize the realization. Following this, we identified the differenced data to be best represented by modelling with an \(AR(5)\) component based on Akaike Information Criterion (AIC) scores. While this was not the top selected by the algorithm, we tested several lag orders and determined this was the best suited selection and that there was not enough noise in the data to benefit from adding an MA term. A model was estimated to a \(ARMA(0,0)\) - practically, white noise - by the Bayesian Information Criterion score, but because AIC did not suggest this and we identified enough noise to justify modeling, we proceeded with an \(AR(5)\) component for the ARMA model.
The estimated residuals from the estimated \(ARMA(5,0)\) model at \(K=24\) lags produced a p-value\(=0.9405\) that we could not use to reject the null hypothesis, where the null is that the distribution of the residuals of the model are close enough to white noise that we cannot effectively distinguish the difference. Low variance (0.1173) also aided in our conclusion to assume the model represents the data well. We then attempted to identify a new model for the residuals using an AIC score, but the top model provided was a \(ARMA(0,0)\). This concluded our model selection analysis for this model.
Finally, we generated realizations using 99 data points (99 because the difference removed one from the original 100 points) and the estimated model parameters. The spectral density and autocorrelation function plots appeared to match closely to those of the original realization, sugged toward the long-run series mean.
In forecasting with the estimated \(ARIMA(5,1,0)\), the results were well placed with an Average Squared Error (ASE) of 0.2026529. While the model performed well over a 5-point forecast, this model would most likely tend back toward the long-run mean with a larger forecast horizon.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
# take a sample of the data, analyze
plotts.sample.wge(df$low, arlimits=T)
ARIMA(p,d,q) Model
## $autplt
## [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
## [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] 14.2537148 9.9541873 5.8173014 -0.5437278 -0.8235314
## [6] -1.3564258 1.4197169 -1.1176687 -7.3533813 -3.8782332
## [11] -0.5445071 -3.3199576 -5.9250727 -5.4172047 -2.9311702
## [16] -11.1750831 -6.3519213 -18.5926329 -18.7079883 -8.4217804
## [21] -9.3402655 -9.6710352 -9.7031584 -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501 -9.7837888
## [31] -12.9010407 -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
##
## $dbz
## [1] 10.6247182 9.9050329 8.7000676 7.0111987 4.8698055
## [6] 2.3951954 -0.1075597 -2.1509930 -3.4458827 -4.1937204
## [11] -4.7075065 -5.1324715 -5.5423563 -6.0346253 -6.7027509
## [16] -7.5765092 -8.5979157 -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
###########################################################
########################################################### ARMA/ARIMA
###########################################################
#factor.wge(df$low)
est.arma.wge(df$low, p=8, q=3) # the factor table for df$low provides one (1-B) represented as (1-0.9956B), a close-enough
##
## Coefficients of Original polynomial:
## -0.3746 0.7725 0.3401 -0.1343 -0.0358 0.3197 0.1747 -0.0838
##
## Factor Roots Abs Recip System Freq
## 1-0.9947B 1.0054 0.9947 0.0000
## 1+1.6828B+0.7314B^2 -1.1504+-0.2091i 0.8552 0.4714
## 1+0.9128B+0.6393B^2 -0.7139+-1.0269i 0.7996 0.3467
## 1-0.9172B+0.5823B^2 0.7875+-1.0474i 0.7631 0.1474
## 1-0.3093B 3.2335 0.3093 0.0000
##
##
## $phi
## [1] -0.37456915 0.77252645 0.34011944 -0.13429170 -0.03580418 0.31974554
## [7] 0.17466434 -0.08376272
##
## $theta
## [1] -1.5943772 -0.8563684 -0.2326950
##
## $res
## [1] 0.341382127 -0.339533981 0.372992944 0.045251115 0.282883734
## [6] 0.543863341 0.027683372 -0.055313770 -0.400705173 0.236964242
## [11] -0.066224077 -0.109657984 0.373268255 -0.028417015 0.268832218
## [16] 0.321186364 0.139491809 -0.060923904 -0.207501945 -0.509810842
## [21] 0.435881683 0.583222371 0.425193624 0.341428403 0.299644074
## [26] 0.165902559 0.268444218 0.150980721 0.417357079 -0.476663592
## [31] 0.367971194 -0.066208814 0.238830218 -0.357908092 -0.154882518
## [36] -0.260220286 -0.126198191 -0.314164842 0.194647972 0.075843457
## [41] 0.065674888 0.249249868 0.326605797 0.129696758 0.072656694
## [46] -0.081003028 -0.590131520 0.281198743 0.348383312 0.735499563
## [51] -0.244213897 -0.177292255 0.295402720 -0.397741559 0.068422622
## [56] 0.468362947 0.658157668 -0.226977155 -0.082270354 0.152294282
## [61] -0.788235363 0.327803117 -0.054959335 -0.071767723 0.259976113
## [66] 0.155307265 -0.019760354 0.873508519 0.739690436 0.089012192
## [71] -0.039686548 -0.439091020 0.389397691 -0.057910431 -0.435274770
## [76] 0.127990892 0.132624679 0.123202670 0.370572426 -0.291439393
## [81] 0.414131539 0.079728839 0.261378249 0.571631165 -0.110336678
## [86] 0.186149091 -0.454963807 -0.443809259 -0.007272801 0.584743178
## [91] 0.077730704 -1.018479869 0.311826735 0.461004571 0.090793997
## [96] -0.174995085 0.530696568 -0.456227949 0.815178176 -0.275805473
##
## $avar
## [1] 0.1295548
##
## $aic
## [1] -1.803651
##
## $aicc
## [1] -0.7413255
##
## $bic
## [1] -1.491031
##
## $se.phi
## [1] 0.56643405 0.22018656 0.51330869 0.16824213 0.06855007 0.02066278
## [7] 0.04665927 0.02784275
##
## $se.theta
## [1] 0.5715739 1.2523688 0.2789225
# approximation. Therefore, we difference the data once. Preliminary evidence suggesting differencing is useful is based
# on the specified wandering and the damping sample autocorrelations. When using an overfit table, there was a factor close
# to (1-b)^2, but it was not very significant (third most significant using estimated_arma <- est.arma.wge(dftrans, p=15, q=1))
# so we decided to only use a first difference.
dftrans <- artrans.wge(df$low, phi.tr=1)
ARIMA(p,d,q) Model
pacf(dftrans)
aic5.wge(dftrans, type="aic") # AIC = -2.001944. p=5, q=0
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 17 5 1 -2.001944
## 13 4 0 -1.985397
## 14 4 1 -1.966194
## 16 5 0 -1.966140
## 10 3 0 -1.934893
estimated_arma <- est.arma.wge(dftrans, p=5, q=0)
##
## Coefficients of Original polynomial:
## 0.1227 -0.1183 -0.1454 -0.2587 -0.0292
##
## Factor Roots Abs Recip System Freq
## 1-1.0644B+0.6446B^2 0.8257+-0.9325i 0.8028 0.1347
## 1+0.8220B+0.3777B^2 -1.0882+-1.2098i 0.6146 0.3666
## 1+0.1198B -8.3479 0.1198 0.5000
##
##
estimated_arma$phi
## [1] 0.12269667 -0.11827203 -0.14543411 -0.25874556 -0.02916165
#estimated_arma <- est.arma.wge(dftrans, p=15, q=1)
ljung.wge(estimated_arma$res, p=5, q=0) # suggests there is no serial correlation in the residuals of the model; this is a good fit.
## Obs -0.001755173 -0.0006645789 -0.03189077 -0.0558602 -0.03745613 -0.006650248 -0.07239689 -0.1447362 -0.01711384 -0.08391131 0.05274902 -0.02483137 0.045087 -0.06707539 -0.004246384 0.07628534 0.08560867 -0.0009574519 -0.1290798 0.02213024 0.01599169 -0.07501175 -0.1093225 -0.02137378
## $test
## [1] "Ljung-Box test"
##
## $K
## [1] 24
##
## $chi.square
## [1] 11.57463
##
## $df
## [1] 19
##
## $pval
## [1] 0.9029941
estimated_arma$avar # 0.1240151
## [1] 0.1240151
mean(df$low) # 38.53802
## [1] 38.53802
# (1-B)(1-0.12269667B+0.11827203B^2+0.14543411B^3+0.25874556B^4-0.02916165B^5)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151
# (1-1.12269667B+0.00442464B^2+0.26370614B^3+0.11331145B^4-0.28790721B^5+0.02916165B^6)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151
######
###### Plotting residuals
# the residuals of the model do not appear correlated. The sample autocorrelation and partial autocorrelation plots indicate
# stationarity across all lags with all lags measured within the 5% limits
plotts.sample.wge(estimated_arma$res, arlimits=T)
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
## $autplt
## [1] 1.0000000000 -0.0017551730 -0.0006645789 -0.0318907670 -0.0558601955
## [6] -0.0374561274 -0.0066502482 -0.0723968930 -0.1447362269 -0.0171138352
## [11] -0.0839113138 0.0527490212 -0.0248313665 0.0450869990 -0.0670753909
## [16] -0.0042463842 0.0762853413 0.0856086721 -0.0009574519 -0.1290798025
## [21] 0.0221302437 0.0159916891 -0.0750117483 -0.1093225110 -0.0213737804
## [26] 0.0330170662
##
## $freq
## [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
## [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
##
## $db
## [1] -8.833455090 0.263787556 -11.498115493 2.117874832 2.272952545
## [6] -9.309718287 5.018377394 0.289213348 -1.892315270 -11.308615271
## [11] 3.725194391 -0.611197010 -12.632295451 -0.590314051 3.046760336
## [16] -12.371079560 0.004883422 2.510605713 4.337557356 -3.178637232
## [21] -17.271334056 -1.077447786 -2.257551049 3.231804970 -1.965461358
## [26] -3.043185676 -1.825761971 -7.420727112 2.312103352 -1.619957827
## [31] -7.841066943 5.001502888 -2.103213653 0.521620232 2.860045258
## [36] -2.911236603 -9.089689687 4.807535654 -1.618776046 -13.746143879
## [41] -6.930503276 3.442666207 -8.907109791 -0.989371943 3.916436743
## [46] -14.309350664 0.883844166 0.342804251 -6.422777409
##
## $dbz
## [1] -2.1421139616 -1.5268677158 -0.7764179465 -0.0840477881 0.4450188370
## [6] 0.7709948539 0.8934228747 0.8371033041 0.6469486934 0.3863575660
## [11] 0.1329634098 -0.0354873769 -0.0674530562 0.0372285907 0.2260959001
## [16] 0.4205848653 0.5490294530 0.5673324474 0.4653779820 0.2642966345
## [21] 0.0072262109 -0.2555822359 -0.4825978402 -0.6484742078 -0.7390119560
## [26] -0.7427274437 -0.6522279590 -0.4757583702 -0.2446363728 -0.0055391248
## [31] 0.1965782737 0.3316246748 0.3873844148 0.3649753175 0.2727409570
## [36] 0.1240460268 -0.0586583169 -0.2366934318 -0.3546826477 -0.3582603902
## [41] -0.2277399300 -0.0007442644 0.2413219942 0.4108729200 0.4464568416
## [46] 0.3298807324 0.0947313908 -0.1721868052 -0.3517455314
par(mfrow=c(3,1))
plot(estimated_arma$res, ylab="ARIMA Residuals", xlab = "Trading Days", main = "ARIMA(5,1,1) Model Residuals over Time", lwd=2, type="l")
abline(h=mean(estimated_arma$res), col="blue")
acf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits
pacf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits
ARIMA(p,d,q) Model
aic5.wge(estimted_arma$res, type="aic") # White noise is the top model selected for the residuals, by AIC (ARMA(0,0))
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 0 0
## Error in aic calculation at 0 1
## Error in aic calculation at 0 2
## Error in aic calculation at 1 0
## Error in aic calculation at 1 1
## Error in aic calculation at 1 2
## Error in aic calculation at 2 0
## Error in aic calculation at 2 1
## Error in aic calculation at 2 2
## Error in aic calculation at 3 0
## Error in aic calculation at 3 1
## Error in aic calculation at 3 2
## Error in aic calculation at 4 0
## Error in aic calculation at 4 1
## Error in aic calculation at 4 2
## Error in aic calculation at 5 0
## Error in aic calculation at 5 1
## Error in aic calculation at 5 2
## Five Smallest Values of aic
## p q aic
## 1 0 0 999999
## 2 0 1 999999
## 3 0 2 999999
## 4 1 0 999999
## 5 1 1 999999
######
######
######
###### Compare estimated model to differenced data
# We generated a model of 99 data points using the ARMA(5,1) identified by aic5 with a variance equal to that estimated from the
# differenced data, which is 0.1172604 (select to see origin highlighted above)
est_mod <- gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar, sn=40)
ARIMA(p,d,q) Model
plotts.sample.wge(est_mod, arlimits=T) # estimated model
ARIMA(p,d,q) Model
## $autplt
## [1] 1.0000000 0.9319568 0.8616508 0.8115232 0.7837261 0.7798462 0.7686385
## [8] 0.7540206 0.7238039 0.6950334 0.6636351 0.6243891 0.5993385 0.5769297
## [15] 0.5519124 0.5366042 0.5296241 0.5194643 0.4818245 0.4413453 0.3916907
## [22] 0.3539872 0.3181363 0.2818305 0.2584268 0.2245354
##
## $freq
## [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
## [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
##
## $db
## [1] 15.4662467 6.4612009 2.8156959 -6.5831271 -0.3491412
## [6] 0.6467012 -23.4462233 -11.9860652 -4.2541616 -2.3339334
## [11] -4.9973000 -0.5165929 -8.4672734 -5.9260513 -12.5070338
## [16] -1.5329287 -9.4007029 -4.2193607 -12.8919484 -17.0150048
## [21] -6.9836378 -14.0339519 -5.8623391 -15.3945590 -11.2233206
## [26] -24.0796617 -10.8749229 -18.3577480 -11.7127985 -7.5363064
## [31] -14.6647063 -14.4603439 -11.9075717 -18.9642429 -19.7115192
## [36] -13.9330289 -14.7859997 -12.3169328 -15.2897638 -15.2611997
## [41] -7.5246856 -25.4508518 -13.1571835 -18.5429171 -14.5183414
## [46] -11.9244393 -15.4696702 -20.2139032 -16.6394603
##
## $dbz
## [1] 10.3894114 9.7014613 8.5461122 6.9166900 4.8252742
## [6] 2.3519955 -0.2434927 -2.4200960 -3.6721480 -4.1238920
## [11] -4.2332485 -4.2725076 -4.3507834 -4.5391300 -4.8913942
## [16] -5.4201345 -6.0871771 -6.8234956 -7.5683596 -8.2978154
## [21] -9.0157797 -9.7227007 -10.4014424 -11.0315750 -11.6029838
## [26] -12.1026903 -12.4960138 -12.7466330 -12.8695657 -12.9469286
## [31] -13.0762896 -13.3067957 -13.6115217 -13.9016587 -14.0770531
## [36] -14.0895034 -13.9687564 -13.7936668 -13.6492815 -13.6034164
## [41] -13.6994736 -13.9536233 -14.3531696 -14.8590436 -15.4153465
## [46] -15.9631305 -16.4485233 -16.8195565 -17.0238291
plotts.sample.wge(df$low, arlimits=T) # estimated model
ARIMA(p,d,q) Model
## $autplt
## [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
## [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] 14.2537148 9.9541873 5.8173014 -0.5437278 -0.8235314
## [6] -1.3564258 1.4197169 -1.1176687 -7.3533813 -3.8782332
## [11] -0.5445071 -3.3199576 -5.9250727 -5.4172047 -2.9311702
## [16] -11.1750831 -6.3519213 -18.5926329 -18.7079883 -8.4217804
## [21] -9.3402655 -9.6710352 -9.7031584 -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501 -9.7837888
## [31] -12.9010407 -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
##
## $dbz
## [1] 10.6247182 9.9050329 8.7000676 7.0111987 4.8698055
## [6] 2.3951954 -0.1075597 -2.1509930 -3.4458827 -4.1937204
## [11] -4.7075065 -5.1324715 -5.5423563 -6.0346253 -6.7027509
## [16] -7.5765092 -8.5979157 -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
#################################################################################################################################################
############### Confirmation that repeated estimated model visualizations of ACF and Spectral Density match that of original data ###############
#################################################################################################################################################
for(i in 1:2)
{
SpecDen2 = parzen.wge(gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar), plot = T)
lines(SpecDen2$freq,SpecDen2$pzgram, lwd = 2, col = "red")
}
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
for(i in 1:2)
{
ACF2 = acf(gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar), plot = T)
lines(ACF2$lag ,ACF2$acf, lwd = 2, col = "red")
}
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
ARIMA(p,d,q) Model
#########################################################################################################
#########################################################################################################
#########################################################################################################
# after comparing sample spectral density and ACF plots, we comapred the AIC-estimated models to compare identified models
aic5.wge(dftrans)
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 17 5 1 -2.001944
## 13 4 0 -1.985397
## 14 4 1 -1.966194
## 16 5 0 -1.966140
## 10 3 0 -1.934893
aic5.wge(est_mod2) # ARMA(5,0) and ARMA(5,1) are top in both models, indicating a potentially useful model..
## ---------WORKING... PLEASE WAIT...
##
##
## Error in aic calculation at 0 0
## Error in aic calculation at 0 1
## Error in aic calculation at 0 2
## Error in aic calculation at 1 0
## Error in aic calculation at 1 1
## Error in aic calculation at 1 2
## Error in aic calculation at 2 0
## Error in aic calculation at 2 1
## Error in aic calculation at 2 2
## Error in aic calculation at 3 0
## Error in aic calculation at 3 1
## Error in aic calculation at 3 2
## Error in aic calculation at 4 0
## Error in aic calculation at 4 1
## Error in aic calculation at 4 2
## Error in aic calculation at 5 0
## Error in aic calculation at 5 1
## Error in aic calculation at 5 2
## Five Smallest Values of aic
## p q aic
## 1 0 0 999999
## 2 0 1 999999
## 3 0 2 999999
## 4 1 0 999999
## 5 1 1 999999
######
######
######
###### Forecast and Measure Average Squared Errors (ASE)
# Model Forecast and ASE
# Forecasting the differenced data using the parameters estimated from it (we do not apply a d=1)
nonseasonal.forecast <- fore.aruma.wge(df$low, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1,n.ahead=5, lastn=T)
ARIMA(p,d,q) Model
ARIMA_ASE = mean((df$low[(nrow(df)-5+1):nrow(df)] - nonseasonal.forecast$f)^2)
ARIMA_ASE# 0.2026529
## [1] 0.2026529
######
######
Following the univariate modeling of this time series, we decided to enhance our model with multivariate factors, both additive and multiplicative. As previously mentioned, based on the candlestick plot we created for exploratory data analysis, we were interested in exploring the relationship of the ranges of open to close and high to low prices as related to the low price. The features we engineered for this aided in describing the data and enhancing the models.
The multivariate models applied include Vector Autoregressive and Neural Network with Multi-Layer Perceptron (MLP) models, in addition to an ensemble model combining the forecasts of the univariate Signal-Plus-Noise and the multivariate MLP model. The ASE scores for all multivariate models outperformed all univariate models. However, an ensemble of Signal-Plus-Noise and the MLP model provides a safeguard of overfitting from a multivariate model given the dependency on additional variables that can change significantly very quickly, disrupting a complex forecast.
The vector Autoregressive (VAR) models used take all 75 points of the training data - including for the exogenous variables - to bulid a model. We forecasted 5 data points ahead from this trainin set of 95 out of 100 data points and then generated the overall ASE and the 5-point rolling window ASE for model comparison.
The autoregressive model identifed is an AR(2) with an AIC score of -2.4212.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)
## $autplt
## [1] 1.000000000 0.332654213 -0.015323504 -0.185197647 0.005818653
## [6] 0.124607489 0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175 0.213392652 0.211975490 0.061996282 -0.059048222
## [26] -0.052510741
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] -7.38499578 -0.95161030 3.39895369 3.13204413 7.25816141
## [6] -3.34022975 1.45696680 -3.16420812 0.98653170 -0.01510512
## [11] 0.74899492 -3.54602921 -7.04898470 6.69701287 0.21917900
## [16] 3.26040620 -2.30071218 5.83627330 0.12105751 4.40156709
## [21] 1.31507769 1.49404141 4.21104386 -7.07865344 -14.93552255
## [26] -1.90529423 -9.05704419 1.17291691 -2.44739663 -15.18817408
## [31] -3.82537165 -8.97778373 -1.31847009 -8.52554281 -9.96514339
## [36] -0.84871304 -2.04248679 -9.19693386 -5.37219836 -6.41835850
## [41] -0.84106671 -15.69187726 0.78348200 -7.78153910 -19.81844091
## [46] -1.92362217 -0.03109830 -1.67305303 -6.34494742 -2.36851032
##
## $dbz
## [1] 0.7617948 1.3396040 1.9342166 2.3252670 2.4227279 2.2272880
## [7] 1.8074853 1.2989121 0.8916181 0.7572102 0.9377718 1.3279105
## [13] 1.7766211 2.1767262 2.4818412 2.6857904 2.7968223 2.8178953
## [19] 2.7371907 2.5301127 2.1698520 1.6414415 0.9557012 0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
df2 <- df[1:(nrow(df)-5),]
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC picked p=1 with AIC -2.3235
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 2 2 5
##
## $criteria
## 1 2 3 4 5
## AIC(n) -2.32345681 -2.42117979 -2.41376593 -2.40362716 -2.4365656
## HQ(n) -2.26710299 -2.35355521 -2.33487058 -2.31346105 -2.3351287
## SC(n) -2.18364578 -2.25340655 -2.21803048 -2.17992951 -2.1849057
## FPE(n) 0.09794606 0.08883496 0.08950683 0.09043349 0.0875214
## 6
## AIC(n) -2.42032626
## HQ(n) -2.30761863
## SC(n) -2.14070419
## FPE(n) 0.08897736
lsfit <- VAR(y=cbind(df2$low, df2$volume, df2$HiLo, df2$OpClo), p=2, type="const") # with p=5, this will estimate the coefficients for lag 5
preds <- predict(lsfit, n.ahead=5)
ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.2282559
## [1] 0.2282559
plot(seq(1,100,1), df$low[1:100], type = "l", xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling without Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)
In the VAR model with trend, we introduce a time component into the model as an additive predictor term. The model identifed is an AR(2) with an AIC score of -2.4985.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)
## $autplt
## [1] 1.000000000 0.332654213 -0.015323504 -0.185197647 0.005818653
## [6] 0.124607489 0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175 0.213392652 0.211975490 0.061996282 -0.059048222
## [26] -0.052510741
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] -7.38499578 -0.95161030 3.39895369 3.13204413 7.25816141
## [6] -3.34022975 1.45696680 -3.16420812 0.98653170 -0.01510512
## [11] 0.74899492 -3.54602921 -7.04898470 6.69701287 0.21917900
## [16] 3.26040620 -2.30071218 5.83627330 0.12105751 4.40156709
## [21] 1.31507769 1.49404141 4.21104386 -7.07865344 -14.93552255
## [26] -1.90529423 -9.05704419 1.17291691 -2.44739663 -15.18817408
## [31] -3.82537165 -8.97778373 -1.31847009 -8.52554281 -9.96514339
## [36] -0.84871304 -2.04248679 -9.19693386 -5.37219836 -6.41835850
## [41] -0.84106671 -15.69187726 0.78348200 -7.78153910 -19.81844091
## [46] -1.92362217 -0.03109830 -1.67305303 -6.34494742 -2.36851032
##
## $dbz
## [1] 0.7617948 1.3396040 1.9342166 2.3252670 2.4227279 2.2272880
## [7] 1.8074853 1.2989121 0.8916181 0.7572102 0.9377718 1.3279105
## [13] 1.7766211 2.1767262 2.4818412 2.6857904 2.7968223 2.8178953
## [19] 2.7371907 2.5301127 2.1698520 1.6414415 0.9557012 0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC: p=1 with AIC -2.3158
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 2 2 2 2
##
## $criteria
## 1 2 3 4 5
## AIC(n) -2.3288269 -2.49848113 -2.47601445 -2.45354442 -2.45473227
## HQ(n) -2.2612024 -2.41958579 -2.38584835 -2.35210755 -2.34202465
## SC(n) -2.1610537 -2.30274568 -2.25231680 -2.20188456 -2.17511021
## FPE(n) 0.0974299 0.08223654 0.08411857 0.08604793 0.08596807
## 6
## AIC(n) -2.45522002
## HQ(n) -2.33124163
## SC(n) -2.14763575
## FPE(n) 0.08595343
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)
ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1132188
## [1] 0.1132188
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)
Our VAR model with additive terms for trend, volume, the spread between high and low and open and close prices in addition to a multiplicative term for trend and its impact on volume outperformed all other VAR models we built in terms of ASE. Adding the multiplicative term in this model reduced the ASE to roughly half, from 0.2139 to 0.1296. The Autoregressive lag order identified for the model is an AR(2) with an AIC score of -2.4780. This model includes two univariate lags on the univariate input nodes one and two with four exogenous regressor lags across regressor nodes one and two.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
#VARselect(df2$low, lag.max=6, type="const", season=NULL, exogen=data.frame(t[96:100], df2$volume, df2$HiLo, df2$OpClo, t[96:100]*df2$volume)) # AIC: p=1 with AIC -2.3158
# trend added manually:
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), type="const") # AIC: p=1 with AIC -2.3081
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 2 2 2 2
##
## $criteria
## 1 2 3 4 5
## AIC(n) -2.30801515 -2.47800080 -2.455533 -2.43306769 -2.43298718
## HQ(n) -2.22911981 -2.38783470 -2.354096 -2.32036006 -2.30900879
## SC(n) -2.11227971 -2.25430315 -2.203873 -2.15344563 -2.12540291
## FPE(n) 0.09949085 0.08395165 0.085877 0.08785085 0.08788582
## 6
## AIC(n) -2.43521003
## HQ(n) -2.29996088
## SC(n) -2.09966355
## FPE(n) 0.08772417
# trend added manually:
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)
ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1296242
## [1] 0.1296242
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend and Volume Interaction", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)
Finally, we modeled with an interaction between trend and volume and trend and the range between open and close prices (we also attempted for interaction between trend and the range between high and low prices, but this hurt the model, potentially indicating there is greater variation in between the high and low prices than the open and close prices). Nonetheless, the ASE resulting from the forecast was higher than the multiplicative model only including an interaction between trend and trade volume. Therefore, the latter model is the one we selected as the top VAR candidate. As with the other VAR models, this model applied an identified second-order autoregressive with an AIC score of -2.4645.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)
## $autplt
## [1] 1.000000000 0.332654213 -0.015323504 -0.185197647 0.005818653
## [6] 0.124607489 0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175 0.213392652 0.211975490 0.061996282 -0.059048222
## [26] -0.052510741
##
## $freq
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
##
## $db
## [1] -7.38499578 -0.95161030 3.39895369 3.13204413 7.25816141
## [6] -3.34022975 1.45696680 -3.16420812 0.98653170 -0.01510512
## [11] 0.74899492 -3.54602921 -7.04898470 6.69701287 0.21917900
## [16] 3.26040620 -2.30071218 5.83627330 0.12105751 4.40156709
## [21] 1.31507769 1.49404141 4.21104386 -7.07865344 -14.93552255
## [26] -1.90529423 -9.05704419 1.17291691 -2.44739663 -15.18817408
## [31] -3.82537165 -8.97778373 -1.31847009 -8.52554281 -9.96514339
## [36] -0.84871304 -2.04248679 -9.19693386 -5.37219836 -6.41835850
## [41] -0.84106671 -15.69187726 0.78348200 -7.78153910 -19.81844091
## [46] -1.92362217 -0.03109830 -1.67305303 -6.34494742 -2.36851032
##
## $dbz
## [1] 0.7617948 1.3396040 1.9342166 2.3252670 2.4227279 2.2272880
## [7] 1.8074853 1.2989121 0.8916181 0.7572102 0.9377718 1.3279105
## [13] 1.7766211 2.1767262 2.4818412 2.6857904 2.7968223 2.8178953
## [19] 2.7371907 2.5301127 2.1698520 1.6414415 0.9557012 0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
# find a good count for p for the AR(p) component of the vector autoregressive model. No indication above for lag was required, but :
VARselect(df2$low, lag.max=8, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), type="const") # selects p=1 as the best value for AR(p)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 2 2 2 2
##
## $criteria
## 1 2 3 4 5
## AIC(n) -2.2713589 -2.46459045 -2.44187917 -2.41889398 -2.41500677
## HQ(n) -2.1800536 -2.36187199 -2.32774754 -2.29334919 -2.27804882
## SC(n) -2.0446087 -2.20949651 -2.15844146 -2.10711249 -2.07488151
## FPE(n) 0.1032257 0.08510686 0.08708604 0.08914122 0.08952502
## 6 7 8
## AIC(n) -2.41422092 -2.40044609 -2.37762703
## HQ(n) -2.26584981 -2.24066181 -2.20642959
## SC(n) -2.04575189 -2.00363329 -1.95247046
## FPE(n) 0.08963886 0.09093375 0.09309378
# Build the model using the p identified for AR(p) and the training data. type="const" because trend is included
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)
ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1703138
## [1] 0.1703138
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)
A multi-layered perceptron operates using a “feedforward” artificial neural network to pass forward information from the input nodes into a hidden layer of nodes that perform logical activation functions before processing into the output node. In the case of this analysis, lags were detected and applied on input to both the predictor regressors as well as the univariate response.
When adding additional predictor variables to the multivariate models, the forecasts converge less toward the long-run mean and more toward the behavior of the realization itself. With this said, the addition of further predictor variables - such as those through natural language processing of market sentiment analysis - can aid in further stabilizing useful multivariate models.
The Neural Network without trend model we produced generated the highest Average Squared Error of all our MLP models. However, the ASE was still highly competitive with all other models we produced. The ASE for this model is 0.1758194.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg
fit.mlp <- mlp(lowTrain, xreg=data.frame(volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 6 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 2 regressors included.
## - Regressor 1 lags: (1,2)
## - Regressor 2 lags: (1,2,4)
## Forecast combined using the mean operator.
## MSE: 0.0382.
plot(fit.mlp)
fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)
ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.1758194
## [1] 0.1835049
After adding trend into our Neural Network modeling, we successfully lowered the ASE to 0.06421939.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
t <- 1:nrow(df)
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg
fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 1 hidden node and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.083.
plot(fit.mlp)
fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)
ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.06421939
## [1] 0.06459411
When we added a multiplicative term for trend and volume, our ASE lowered from 0.0642 to 0.0549. Although the Neural Nets that included a trend predictor were very close in terms of residual error, this model produced the lowest ASE. The final model
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
t <- 1:nrow(df)
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg
fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 5 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0599.
plot(fit.mlp)
fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull))
plot(fore.mlp)
ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05486796
## [1] 0.06564621
It is important to note that we did not identify an interaction between time and the spread between high and low prices to benefit the model, but we did find that interaction between time and the spread between open and close prices to be beneficial for reducing ASE. However, the overall reduction from this model increased the ASE from 0.0549 in the model using an interaction between only trend and volume to an ASE of 0.0571.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
t <- 1:nrow(df)
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
High <- ts(df$high)
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg
fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv", difforder=NULL)
fit.mlp
## MLP fit with 7 hidden nodes and 15 repetitions.
## Univariate lags: (1)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0654.
plot(fit.mlp)
fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull))
plot(fore.mlp)
ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05711815
## [1] 0.06303533
To provide additional analysis for the best models of each category, we have provided below the Average Squared Error from a rolling 5-point forecast window.
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
#MLP
ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
NN_ASEs = numeric(num_batches)
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg
#Rolling ASE
for (i in 0:(num_batches-1))
{
fit.mlp <- mlp(ts(ts[i:(i+(batch_size-1))]), xreg=data.frame(volumeFull[i:(i+(batch_size-1))], HiLoFull[i:(i+(batch_size-1))], OpCloFull[i:(i+(batch_size-1))]), reps = 15, comb = "mean", hd.auto.type="cv")
fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
NN_ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - fore.mlp$mean)^2)
start = start+1
}
NN_ASEs
## [1] 0.1852031 0.1678351 0.1800357 0.1032628 0.1136498 0.0870527
mean(NN_ASEs) # 0.1468514
## [1] 0.1395065
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
df2 <- df[1:(nrow(df)-5),]
t = 1:100
trainingSize = 91
horizon = 5
ASEHolder_2 = numeric()
for( i in 1:5)
{
VARselect(df2$low[i:(i + (trainingSize - 1))], lag.max=6, season=NULL, exogen=data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))])), type="const") # AIC picked p=2 with AIC -2.44854735
lsfit <- VAR(y=cbind(df2$low[i:(i + (trainingSize - 1))], exogen = data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))]))), p=1, type="const")
preds <- predict(lsfit, n.ahead=5)
ASE_2 = mean((df$low[(trainingSize + i):(trainingSize + i + (horizon) - 1)] - preds$fcst$df2.low.i..i....trainingSize...1...[,1])^2)
ASEHolder_2[i] = ASE_2
}
rolling_ASE = mean(ASEHolder_2)
rolling_ASE # 0.1869798
## [1] 0.1869798
#Signal + Noise
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
df2 <- df[1:(nrow(df)-5),]
ts = df$low
t2 = 1:95
batch_size = 96
start = 1
t = 100
num_batches = length(ts)-batch_size+1
sigplus_ASEs = numeric(num_batches)
for (i in 0:(num_batches-1))
{
signoise.forecast <- fore.sigplusnoise.wge(df$low[i:(i+(batch_size-1))], max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)
sigplus_ASEs[i+1] = mean((df$low[(nrow(df)-4):nrow(df)] - signoise.forecast$f)^2)
}
##
## Coefficients of Original polynomial:
## 1.1090 -0.3881
##
## Factor Roots Abs Recip System Freq
## 1-1.1090B+0.3881B^2 1.4288+-0.7317i 0.6230 0.0753
##
##
##
## Coefficients of Original polynomial:
## 1.1077 -0.3812
##
## Factor Roots Abs Recip System Freq
## 1-1.1077B+0.3812B^2 1.4531+-0.7156i 0.6174 0.0728
##
##
##
## Coefficients of Original polynomial:
## 1.1170 -0.3918
##
## Factor Roots Abs Recip System Freq
## 1-1.1170B+0.3918B^2 1.4254+-0.7213i 0.6260 0.0746
##
##
##
## Coefficients of Original polynomial:
## 1.0446 -0.3237
##
## Factor Roots Abs Recip System Freq
## 1-1.0446B+0.3237B^2 1.6135+-0.6971i 0.5689 0.0649
##
##
##
## Coefficients of Original polynomial:
## 1.0617 -0.3442
##
## Factor Roots Abs Recip System Freq
## 1-1.0617B+0.3442B^2 1.5423+-0.7257i 0.5867 0.0700
##
##
sigplus_ASEs
## [1] 0.13086079 0.15037944 0.20082465 0.08539453 0.09946937
rolling_ASE_sigplus = mean(sigplus_ASEs)
rolling_ASE_sigplus # 0.1333858
## [1] 0.1333858
### Rolling ASE for Neural Networks: Multi-Layered Perceptrons
#ARIMA
df <- stock_data()
## Parsed with column specification:
## cols(
## times = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## symbol = col_character(),
## market = col_character()
## )
ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
ASEs = numeric(num_batches)
for (i in 1:5)
{
forecasts = fore.aruma.wge(ts[start:(batch_size+i)], phi = c(0.1227,-.1183,.1454,-.2587,.029), d = 1, n.ahead = 5, lastn = TRUE, limits = FALSE)
ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - forecasts$f)^2)
}
rolling_ASE = mean(ASEs)
rolling_ASE #0.3935643
## [1] 0.3935643
#Ensemble
mlp_pred = as.numeric(fore.mlp$mean)
signoise = as.numeric(signoise.forecast$f)
en_pred = (mlp_pred + signoise)/2
ASE = mean((ts[(length(ts) - 4):length(ts)] - en_pred)^2)
ASE # 0.06966031
## [1] 0.06732777
rolling_ASEs_ensemble = (sigplus_ASEs + NN_ASEs)/2
rolling_ASEs_ensemble
## [1] 0.15803194 0.15910728 0.19043018 0.09432865 0.10655961 0.10895674
rolling_ASE_mean_ensemble = mean((sigplus_ASEs + NN_ASEs)/2)
rolling_ASE_mean_ensemble
## [1] 0.1362357